Welcome to “How to Create Jitter plots in R with ggplot2!” In this scenario, we will cover how to build a jitter plot with a numeric and categorical variable.
This scenario assumes you’ve done some data wrangling with tidyr and dplyr, and data visualization with ggplot2.
It’s best to start a project off with a “view of the forest from outside the trees.” The technical term for this is data lineage, which:
“Includes the data origin, what happens to it, and where it moves over time.”
Having a “bird’s eye view” of the data ensures there weren’t any problems with exporting or importing. Data lineage also means understanding where the data is coming from (e.g., a relational database, API, flat .csv files, etc.).
Knowing some of the technical details behind a dataset lets us frame the questions or problems we’re trying to tackle. In this scenario, we will use tabular data data (i.e., spreadsheets). Tabular data organizes information into columns and rows.
Let’s load some data and get started!
Launch an R console by clicking here: R
The package we’ll use to view the entire dataset with R is skimr. We will install and load the following packages:
install.packages(c("tidyverse", "skimr"))
library(tidyverse)
library(skimr)IMDB makes multiple datasets available for download. We’ve combined the title.ratings.tsv, name.basics.tsv, and title.principals.tsv datasets into the ImdbData dataset with the following columns:
tconst = alphanumeric unique identifier of the title (used for joining)nconst = alphanumeric unique identifier of the name/person (used for joining)category = the category of job that person was inprimaryName = name by which the person is most often creditedbirthYear = in YYYY formataverageRating = weighted average of all the individual user ratingsnumVotes = number of votes the title has receivedprimaryTitle = the more popular title/the title used by the filmmakers on promotional materials at the point of releaseoriginalTitle = original title, in the original languageisAdult = non-adult title or adult titlestartYear = represents the release year of a title. In the case of TV Series, it is the series start year.runtimeMinutes = primary runtime of the title, in minutesgenres = includes up to three genres associated with the title.age_lead = age of actor/actress at time of film (age_lead = startYear - birthYear)# click to execute code
ImdbData <- readr::read_csv("https://bit.ly/2O2ZKDC")
glimpse(ImdbData)#> Rows: 136,925
#> Columns: 14
#> $ tconst <chr> "tt0000574", "tt0000630", "tt0000886", "tt0001101", "…
#> $ nconst <chr> "nm0846887", "nm0624446", "nm0609814", "nm0923594", "…
#> $ category <chr> "actress", "actress", "actor", "actor", "actor", "act…
#> $ primaryName <chr> "Elizabeth Tait", "Fernanda Negri Pouget", "Jean Moun…
#> $ birthYear <dbl> 1879, 1889, 1841, 1870, 1869, 1844, 1891, 1879, 1867,…
#> $ averageRating <dbl> 6.1, 3.2, 5.0, 5.2, 4.7, 5.8, 5.5, 3.9, 4.6, 5.3, 4.2…
#> $ numVotes <dbl> 609, 11, 23, 13, 10, 22, 47, 11, 13, 28, 14, 13, 25, …
#> $ primaryTitle <chr> "The Story of the Kelly Gang", "Hamlet", "Hamlet, Pri…
#> $ originalTitle <chr> "The Story of the Kelly Gang", "Amleto", "Hamlet", "A…
#> $ isAdult <chr> "non-adult title", "non-adult title", "non-adult titl…
#> $ startYear <dbl> 1906, 1908, 1910, 1910, 1910, 1912, 1910, 1911, 1910,…
#> $ runtimeMinutes <dbl> 70, NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, N…
#> $ genres <chr> "Biography,Crime,Drama", "Drama", "Drama", "\\N", "Cr…
#> $ age_lead <dbl> 27, 19, 69, 40, 41, 68, 19, 32, 43, 28, 52, 39, 23, 4…
In the following, we build a skimr::skim() of the ImdbData dataset:
# click to execute code
SkimImdbData <- skimr::skim(ImdbData)
summary(SkimImdbData)| Name | ImdbData |
| Number of rows | 136925 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
First, we will review the character variables:
# click to execute code
SkimImdbData %>%
skimr::yank("character")Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| tconst | 0 | 1 | 9 | 10 | 0 | 136925 | 0 |
| nconst | 0 | 1 | 9 | 10 | 0 | 41765 | 0 |
| category | 0 | 1 | 5 | 7 | 0 | 2 | 0 |
| primaryName | 0 | 1 | 2 | 38 | 0 | 41658 | 0 |
| primaryTitle | 0 | 1 | 1 | 196 | 0 | 124228 | 0 |
| originalTitle | 0 | 1 | 1 | 196 | 0 | 128171 | 0 |
| isAdult | 0 | 1 | 11 | 15 | 0 | 2 | 0 |
| genres | 0 | 1 | 2 | 31 | 0 | 1056 | 0 |
These all look complete.
The number of individual responses (n_unique) for each character variable is a good source for sanity checks. For example, the largest number of unique values belongs to the title id variable (tconst = 136925), and this is identical to the number of rows in the dataset. The next largest number belongs to the originalTitle variable (128171), and the documentation tells us this variable is the title for the film in its original language. By itself, this number doesn’t tell us much, but we can see the next largest number (124228) is the film’s primaryTitle, and it makes sense that the number of unique responses for these two variables is almost the same.
It also makes sense that the n_unique for actor/actress (nconst) is close to the actor/actress primaryName. There should be way more titles (originalTitle or primaryTitle) than genres, and there are (1056).
Finally, we can see the two binary variables we read about above (category and isAdult) only list 2 unique values (in n_unique), so it appears we imported these variables correctly.
Next, we will review the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in ImdbData:
# click to execute code
SkimImdbData %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p50, numeric.p100,
numeric.hist) %>%
skimr::yank("numeric")Variable type: numeric
| skim_variable | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|
| birthYear | 1946.85 | 27.79 | 1839 | 1950.0 | 2015 | ▁▂▆▇▂ |
| averageRating | 6.00 | 1.16 | 1 | 6.1 | 10 | ▁▂▇▆▁ |
| numVotes | 6015.29 | 43514.99 | 10 | 125.0 | 2334927 | ▇▁▁▁▁ |
| startYear | 1985.28 | 26.46 | 1906 | 1990.0 | 2021 | ▁▂▃▅▇ |
| runtimeMinutes | 97.14 | 24.13 | 2 | 94.0 | 1500 | ▇▁▁▁▁ |
| age_lead | 38.43 | 12.95 | 1 | 36.0 | 98 | ▁▇▅▁▁ |
Let’s take a look:
The average birthYear is 1947, which is plausible considering the date range for movies in the IMDB (1906 - 2021).
The average movie rating is a 6.00, which can be a little confusing considering IMDB’s rating scale. Still, we can feel confident the data isn’t skewed because the mean and median (p50) are relatively close to each other.
The number of votes (numVotes) is the most skewed variable because it ranges from 10 to 2334927.
The startYear for the movie has an average of 1985, and increases steadily from 1906 to 2021, making sense because more films are being made every year.
The average length of each movie in ImdbData is 97.1 minutes (runtimeMinutes). But we can also see from the hist that the range for runtimeMinutes includes some very low and high values (p0 = 2 and p100 = 1500).
The actor/actress’s average age is 38.4, with a low of 1 and a high of 98 (both plausible).
We will proceed under the assumption that our stakeholders asked us to help explain the relationship between the average rating a movie received (averageRating) and the number of votes that went into the score (numVotes).
There are quite a few years in this dataset, so instead, we will split each measure into decades. To do this, we need a categorical variable from the startYear variable. The cut() function is handy because we can supply the number of breaks we want to split the numeric startYear variable into (12 in this case). We will also create some clear labels for this categorical variable with the labels argument and make sure the format is ordered.
We check our new factor variable with the fct_count() from the forcats package:
# click to execute code
ImdbData <- ImdbData %>%
mutate(year_cat10 = cut(x = startYear,
breaks = 12,
labels = c("1910s", "1920s", "1930s",
"1940s", "1950s", "1960s",
"1970s", "1980s", "1990s",
"2000s", "2010s", "2020s"),
ordered = TRUE))
# check the count of our factor levels
fct_count(f = ImdbData$year_cat10, sort = TRUE)We want to examine how the numVotes variable changed over time (year_cat10). We want each decade on the x axis and the numVotes for each film on the y. Let’s review the numVotes variable below with skimr::skim():
ImdbData$numVotes %>% skimr::skim()| Name | Piped data |
| Number of rows | 136925 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 6015.29 | 43514.99 | 10 | 33 | 125 | 666 | 2334927 | ▇▁▁▁▁ |
We can see the values for this variable are concentrated, or skewed, towards 0. However, we know enough information to build our labels:
labs_numvotes_yearcat10 <- labs(
title = "Number of IMDB review votes by decade",
subtitle = "Internet Movie Database (IMDB)",
caption = "https://www.imdb.com/",
x = "Decade",
y = "Number of votes")We’re going to view the distribution of numVotes across each decade in year_cat10 with ggplot2::geom_jitter(). geom_jitter() geom is similar to geom_point(), except it adds “a small amount of random variation to the location of each point.” We will set the following arguments inside the geom_jitter() to help demonstrate how it works:
size = 0.9 = this controls how small/large the point will bewidth = 0.25 = determines how wide we want the points to ‘jitter’ (the default value is .40, so we’re decreasing it slightly)alpha = 1/6 = the alpha controls the saturation (or transparency) of the pointsshow.legend = FALSE = remove the legend (it’s labeled across the x axis)# # click to execute code
gg_step3_jitter_01 <- ImdbData %>%
ggplot(aes(x = year_cat10,
y = numVotes,
fill = year_cat10)) +
geom_jitter(size = 0.9,
width = 0.25,
alpha = 1/6,
show.legend = FALSE) +
guides(fill = FALSE) +
# add labels
labs_numvotes_yearcat10
# save
# ggsave(plot = gg_step3_jitter_01,
# filename = "gg-step3-jitter-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step3_jitter_01Open gg-step3-jitter-01.png in the VS Code IDE above the Terminal console to view the graph.
Notice how there are a handful of extreme points above 15000000?
Data with extreme values like this can be removed or transformed. If we have a reason to exclude this data, we can do filter the data to only the range we want in the graph (say Number of votes < 1500000), and then include this in the plot.
Run the following code to see what this would look like:
# click to execute code
labs_jitter_numvote_yearcat10_02 <- labs(
title = "Number of IMDB review votes by decade",
subtitle = "Internet Movie Database (IMDB)",
caption = "*Number of votes < 1500000; https://www.imdb.com",
x = "Decade",
y = "Number of votes")
gg_step4_jitter_02 <- ImdbData %>%
filter(numVotes < 1500000) %>%
ggplot(aes(x = year_cat10,
y = numVotes)) +
geom_jitter(size = 0.9,
width = 0.25,
alpha = 1/6,
show.legend = FALSE) +
labs_jitter_numvote_yearcat10_02
# save
# ggsave(plot = gg_step4_jitter_02,
# filename = "gg-step4-jitter-02.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step4_jitter_02Open gg-step4-jitter-02.png in the VS Code IDE above the Terminal console to view the graph.
Now the extreme points above 15000000 have been removed.
Another option is to transform the axis, which we can do with ggplot2::scale_y_log10(). This function changes the y axis scale using a base-10 log transformation.
We will also use the handy label_log10 function developed by Claus Wilke:
# click to execute code
# load label_log10 function
source("https://bit.ly/35Ywt2q")
# create new labels
labs_jitter_numvote_yearcat10_03 <- labs(
title = "Number of IMDB review votes by decade",
subtitle = "Internet Movie Database (IMDB)",
caption = "*Number of votes log10 transformed; https://www.imdb.com",
x = "Decade",
y = "log10(Number of votes)")
# create new graph
gg_step5_jitter_03 <- ImdbData %>%
ggplot(aes(x = year_cat10,
y = numVotes)) +
geom_jitter(size = 0.9,
width = 0.25,
alpha = 1/6,
show.legend = FALSE) +
scale_y_log10(labels = label_log10) +
labs_jitter_numvote_yearcat10_03
# save
# ggsave(plot = gg_step5_jitter_03,
# filename = "gg-step5-jitter-03.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step5_jitter_03Open gg-step5-jitter-03.png in the VS Code IDE above the Terminal console to view the graph.
We can see the scale_y_log10() transformation spreads the points out more uniformly across the y axis and makes it more challenging to interpret the number of votes.
Regardless of choice (removal of the extreme values or transforming the scale), we want to communicate these changes to our stakeholders. We also want to include them in any reports or write-ups to give them a distorted view of the underlying variable.
Read more about transforming a scale in R for Data Science.
In this scenario we covered how to:
skim() variables to get summary statisticsmutate()ggplot2::geom_jitter()label_log10() functionggplot2::ggsave()We’ve concluded the “How to Create Jitter plots in R with ggplot2” scenario! Thank you for completing this scenario, and be sure to check out the other scenarios on the O’Reilly Learning Platform.